Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Targeted Gene Metagenomic Data Analysis ◾ 259

7.2.3.2 VSEARCH

VSEARCH [12] is another alignment-based and open-source tool like BLAST. However,

unlike BLAST, which uses seed-based heuristic search algorithm, it uses a fast heuris-

tic search by identifying a small set of database sequences that have many k-mer words

in common with the query sequence (a representative sequence). VSEARCH algorithm

counts the number of shared words between a representative sequence and each data-

base sequence (a word will count once if it appears multiple times). Thus, the similarity

between a representative sequence and a target sequence is based on the statistics of shared

words rather than alignment. Then, the algorithm performs pairwise global alignment

(Needleman-Wunsch) for the query sequence with each sequence of the database begin-

ning with the sequences with the largest number of words in common with the representa-

tive sequence. The optimal global alignment score is computed for each alignment. Unlike

BLAST, exhaustive pairwise alignment is computationally expensive. However, VSEARCH

employs parallel strategies by using vectorization and multiple threading that reduce the

computational cost.

7.2.3.3 Ribosomal Database Project

Ribosomal Database Project (RDP) [13] Classifier is a machine learning taxonomy classifi-

cation method. It is a model trained by known reference sequences with known taxonomic

groups. After training, the classifier is used to predict the taxonomy of the representative

sequences. RDP uses the naïve Bayesian algorithm to accurately predict the taxonomy of a

sequence. The model is trained by N sequences with known taxonomy. First, the algorithm

forms a dictionary from all possible 8-base subsequences or words from all sequences. The

words of all N sequences are called corpus. Let W = {w1, w2, …, wd} be the set of all possible

eight-character words in the corpus and n(wi) be the number of sequences containing word

wi, where i=1, 2, …, d; thus, the expected-likelihood estimate of observing word wi in a

sequence is calculated as

n w

[

]

(

)

(

)

0.5

(7.5)

The numbers 0.5 and 1 are used to keep the value of Pi to be between 0 and 1 (e.g., 0 Pi <1).

Assume that m(wi) is the number of the word wi contained by M training sequences

whose genus is G. Thus, the conditional probability that a sequence of genus G contains the

word wi is estimated as:

P w G

m w

[

]

(

)

(

)

(7.6)

Now assume that a sequence S of genus G has the observed set of words V={v1, v2, …, vf}

V⊑W; thus, the joint probability or likelihood of observing the sequence S is the product of

the probability of each word given genus G as